I wanted to share some updates. We have been working on making intermittent bugs more "actionable" (and less noise). Intermittent bugs are created by the code sheriffs while looking at treeherder. The sheriffs
annotate failures with bug numbers and we post comments to bugzilla (daily if enough failures, or weekly for all failures). After 21 days of no failures we have a bot that closes the intermittent failure bugs. If new failures show up in the future we reopen the bug.
The workflow has been in place for probably 10+ years and we are finally making some changes to this:
1) Open/reopen bugs on 3rd failure. Many of the bugs we see are opened and never have another failure. This means we do not reproduce this in CI. Now newly opened bugs will have more confidence they can reproduce and as triage owners and developers you don't have to look at random failures.
2) Bugzilla comments have been changing since the new year. Specifically we are focusing on what specific build types and variants (there is a table posted) are having issues. This makes it easier to see if a variant is a problem or an entire platform is a problem.
3) You can now filter by "NEW" failures only, just add `&failure_classification_id=6` to the url while viewing try (
example). This will filter the treeherder view to only show tasks with a NEW failure which is defined as not seeing the [sanitized] error message on autoland/central in the last 21 days. NEW failures catch all regressions, and filter out more than 50% of the intermittents.
These 3 changes are all inside the
Treeherder code base, There is some future work planned to bugzilla comments, and more UI stuff to make NEW failures easier to toggle and part of push health.
Have feedback? Reach out to @aryx / @jmaher on Matrix in #treeherder.